Predictive Analytics in R

David O'Brien dunder.chief@gmail.com

August 25, 2015

What is Predictive Modeling?


  1. Given a set of predictor variables (X)

  2. Predict an outcome (Y)

Our Flower!


What kind of iris is this?


Our guess:

Sepal Length [X1] Sepal Width [X2] Petal Length [X3] Petal Width [X4] Species [Y]
6.5 2.8 4.6 1.5 ???
Sepal Length [X1] Sepal Width [X2] Petal Length [X3] Petal Width [X4] Species [Y]
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica

setosa versicolor virginica
Probablity 0 0.995 0.005

Versicolor!

Implementation in R:


library(MASS)
trainset <- iris[-our_flower, ] 
fit.lda <- lda(Species ~ ., data=trainset, prior=c(1/3, 1/3, 1/3)) 
pred <- predict(fit.lda, newdata=iris[our_flower, ])
round(pred$posterior, 3)
##    setosa versicolor virginica
## 55      0      0.995     0.005


Machine Learning Basics



library(caret)
trainIndex <- createDataPartition(iris$Species, p = .5,
                                  list = FALSE,
                                  times = 1)
irisTrain <- iris[ trainIndex, ]
irisTest  <- iris[-trainIndex, ]


Model Tuning

\[y = \sin(2x)\]

x <- seq(pi, 5, by=.05)
y <- sin(2*x)

plot of chunk unnamed-chunk-8

set.seed(1)
error <- rnorm(length(x), sd=.5)
dat <- data.frame(X = x , Y = y + error)

plot of chunk unnamed-chunk-10

Tuning polynomials

set.seed(100)
trainIndex <- createDataPartition(y=dat$Y, p=0.5, list=FALSE)

training <- dat[trainIndex, ]
test <- dat[-trainIndex, ]


\[y=\theta_3x^3 + \theta_2x^2 + \theta_1x + \theta_0\]


fit <- lm(Y ~ poly(X, 3, raw=TRUE), data=training)
pred.training <- predict(fit, newdata=training)
pred.test <- predict(fit, newdata=test)

In-sample (training set) | Out-of-sample (test set)

Parsimony / Occam's Razor

The simplest model is usually the best.

Data Splitting


  1. Training set [60%]:
    Train a model 100x with different tuning parameters

  2. Cross-validation set [20%]:
    Evaluate these 100 models

  3. Test set [20%]:
    Use final model (only one!) to evaluate your the accuracy of your analysis

40% of data on testing?!?


k-fold cross-validation


Typical flow for trying a new algorithm:

  1. Find the package(s) and install
  2. Find training function
  3. Split data into multiple train/test sets
  4. Set up your data to fit the training model
    • Formula
    • Matrix
    • Data.frame
    • X, Y as seperate objects
  5. Pre-process data
  6. Look up tuning params
  7. Write loops for model tuning / repeated cross-validation
  8. Analyze results

predict(fitObject, type = ???)


Model Probability
gbm "response"
mda "posterior"
rpart "prob"
Weka "probability"
LogitBoost "raw"
lda None needed

Caret

Website: https://topepo.github.io/caret/index.html

List of Models: https://topepo.github.io/caret/modelList.html


options(stringsAsFactors=FALSE)
models <- read.csv('../caret_models.csv')
table(models$Type)
## 
## Classification       Dual Use     Regression 
##             74             73             45
class_models <- subset(models, Type %in% c('Classification', 'Dual Use'),
                       select='method.Argument')

Train lots of models at once


library(caret); library(doMC); registerDoMC(7)
myFits <- foreach(this.model = class_models) %do% {
  train(Species ~ ., 
        data=iris,
        method=this.model,
        preProcess='pca',
        trControl=trainControl(method='repeatedcv', number=5, repeats=7),
        tuneLength=5)
}



caret: Basic Syntax

train(Species ~ ., 
        data=iris,
        method='gbm',
        preProcess='knnImpute',
        trControl=trainControl(method='repeatedcv', number=5, repeats=7),
        tuneLength=5)

train():

  • method: our machine learning algorithm (select from 192)
  • preProcess:
PreProcess_Option
BoxCox
YeoJohnson
expoTrans
center
scale
range
knnImpute
bagImpute
medianImpute
pca
ica
spatialSign

trainControl():

Resampling_Method
boot
boot632
cv
repeatedcv
LOOCV
LGOCV
none
oob
adaptive_cv
adaptive_boot
adaptive_LGOCV

Adding custom tuning params

gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9),
                        n.trees = (1:30)*50,
                        shrinkage = 0.1,
                        n.minobsinnode = 20)
head(gbmGrid)
##   interaction.depth n.trees shrinkage n.minobsinnode
## 1                 1      50       0.1             20
## 2                 5      50       0.1             20
## 3                 9      50       0.1             20
## 4                 1     100       0.1             20
## 5                 5     100       0.1             20
## 6                 9     100       0.1             20
train(Species ~ ., 
      data=iris,
      method='gbm',
      preProcess='pca',
      trControl=trainControl(method='repeatedcv', number=5, repeats=7),
      tuneGrid = gbmGrid)

Adaptive Resampling

Speed up the optimazion process

fitControl2 <- trainControl(method = "adaptive_cv",
                            number = 10,
                            repeats = 5,
                            ## Estimate class probabilities
                            classProbs = TRUE,
                            ## Evaluate performance using 
                            ## the following function
                            summaryFunction = twoClassSummary,
                            ## Adaptive resampling information:
                            adaptive = list(min = 10,
                                            alpha = 0.05,
                                            method = "gls",
                                            complete = TRUE))

ref: http://arxiv.org/abs/1405.6974

Data Splitting (Time Series)


Time Series

library(quantmod)
gold <- getSymbols('GLD', src='yahoo', from='1970-01-01', auto.assign=FALSE)
library(caret)
slices <- createTimeSlices(Cl(gold), initialWindow=1000, 
                           fixedWindow=TRUE, horizon=500, skip=500)
str(slices)
## List of 2
##  $ train:List of 3
##   ..$ Training0001: int [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ Training0502: int [1:1000] 502 503 504 505 506 507 508 509 510 511 ...
##   ..$ Training1003: int [1:1000] 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 ...
##  $ test :List of 3
##   ..$ Testing0001: int [1:500] 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##   ..$ Testing0502: int [1:500] 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 ...
##   ..$ Testing1003: int [1:500] 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 ...

Data Splitting | Class imbalances

imbal_train <- twoClassSim(10000, intercept = -20, linearVars = 20)
imbal_test  <- twoClassSim(10000, intercept = -20, linearVars = 20)
table(imbal_train$Class)
## 
## Class1 Class2 
##   9452    548
upSample()
downSample()
ROSE()
SMOTE()

http://topepo.github.io/caret/sampling.html

Pre-processing


dummyVars()

nearZeroVar()

findCorrelation()

findLinearCombos()

classDist()

Variable Importance



varImp()

Feature Selection

Recursive Feature Elimination:

rfe() 
rfeControl()

Genetic Algorithms:

gafs()
gafsControl()

Univariate Filters:

sbf()
sbfControl()

Simalated Annealing:

safs()
safsControl()

h2o package:

http://h2o.ai/

  • Open Source Java library

  • Runs Single model over multiple nodes

  • Hadoop & Spark

  • EC2 / Azure / Compute ENgine

  • R, Scala, Python, Web Browser, REST API

  • Run locally

Models available with H2O


  • K-Means
  • GLM
  • DRF
  • Naïve Bayes
  • PCA
  • GBM
  • Deep Learning

h2o.deeplearning(
   x, y, training_frame, model_id = "",
   overwrite_with_best_model, 
   validation_frame, checkpoint,
   autoencoder = FALSE, 
   use_all_factor_levels = TRUE,
   activation = c("Rectifier", "Tanh", "TanhWithDropout",
   "RectifierWithDropout", "Maxout", "MaxoutWithDropout"), 
   hidden = c(200, 200), 
   epochs = 10, 
   train_samples_per_iteration = -2, 
   seed, 
   adaptive_rate=TRUE, 
   rho = 0.99, 
   epsilon = 1e-08, 
   rate = 0.005,
   rate_annealing = 1e-06, 
   rate_decay = 1, 
   momentum_start=0,
   momentum_ramp = 1e+06, 
   momentum_stable = 0,
   nesterov_accelerated_gradient = TRUE, 
   input_dropout_ratio=0, 
   hidden_dropout_ratios, 
   l1 = 0, l2 = 0, max_w2 = Inf,
   initial_weight_distribution=c("UniformAdaptive",
                                 "Uniform","Normal"),
   initial_weight_scale = 1, 
   loss = c("Automatic", "CrossEntropy", "MeanSquare", 
            "Absolute", "Huber"), 
   distribution = c("AUTO", "gaussian", "bernoulli", 
                    "multinomial", "poisson", "gamma", 
                    "tweedie", "laplace","huber"), 
   tweedie_power = 1.5, 
   score_interval = 5, 
   score_training_samples,
   score_validation_samples, 
   score_duty_cycle, 
   classification_stop,
   regression_stop, 
   quiet_mode, 
   max_confusion_matrix_size, 
   max_hit_ratio_k,
   balance_classes = FALSE, 
   class_sampling_factors, 
   max_after_balance_size,
   score_validation_sampling, 
   diagnostics, 
   variable_importances, 
   fast_mode, 
   ignore_const_cols,  
   force_load_balance, 
   replicate_training_data, 
   single_node_mode, 
   shuffle_training_data, 
   sparse, col_major,
   average_activation, 
   sparsity_beta, 
   max_categorical_features,
   reproducible = FALSE, 
   export_weights_and_biases = FALSE,
   offset_column = NULL, 
   weights_column = NULL, 
   nfolds = 0,
   fold_column = NULL, 
   fold_assignment = c("AUTO", "Random", "Modulo"),
   keep_cross_validation_predictions = FALSE, ...)

Use case

library(h2o)
# Initialize h2o with nthreads (default is 2)
localH2O <- h2o.init(nthreads = 4)
# Convert our datasets
iris.train.h2o <- as.h2o(iris.train, localH2O)
iris.test.h2o <- as.h2o(iris.test, localH2O)

# Run the model
model = h2o.deeplearning(x = colnames(iris)[-ncol(iris)],
                         y = "Species",
                         training_frame = iris.train.h2o,
                         activation = "Tanh",
                         hidden = c(10, 10, 10),
                         epochs = 10000)

# Check performance of test set
performance = h2o.performance(model = model, data=iris.test.h2o)

GPU computing for machine learning in R

Packages:

  • gputools
  • rpud

gputools

gpuGLM()
gpuLM()
gpuHclust(gpuDist())

rpud

http://www.r-tutor.com/gpu-computing

rpuHclust()
rvbm()
rhierLinearModel()
rpusvm()

Places to Learn all about machine learning

Good reads